Microbiome data integration workflow for population cohort
studies
# Get starting time
starting_time <- Sys.time()
################################################################################
# List of packages that we need
packages <- c("ggplot2", "mia", "miaViz", "dplyr", "tidyr", "scater",
"ComplexHeatmap", "knitr")
# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )
# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
install.packages("BiocManager")
library("BiocManager")
}
# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
install(packages_need_to_install, ask = FALSE)
}
# Load all packages into session. Stop if there are packages that were not
# successfully loaded
if( any(!sapply(packages, require, character.only = TRUE)) ){
stop("Error in loading packages into the session.")
}
################################################################################
# Additional setup
# Set black and white theme for figures, and Arial font
theme <- theme_bw() + theme(text = element_text(family = "Arial"),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
theme_set(theme)
All authors are affiliated to Turku Data Science Group in
University of Turku, Finland.
# Plot publication graph
path <- "data/PubMed_Timeline_Results_by_Year.csv"
df <- read.csv(path, skip = 1)
x <- "Year"
y <- "Count"
plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
geom_bar(stat="identity")
plot
PubMed publications per year with a search term ‘microbiome’ (fetched: Sep 5, 2023)
Regarding the choice of the data strcuture to handle cohorts with muliple data types or sources MultiAssayExperiment, is considered to be very liable in such situation, easing-up the management and wrangling of the data; once is construced correctly. Moreover, several R packages frameworks are increasingly integrating MultiAssayExperiment and the SummarizedExperimentclass, to provide users with a reliable and ease of use data structure.
We get the data from MGnify database. It is a EMBL-EBI’s database for metagenomic data. This large microbiome database can be accessed with MGnifyR package which nowadays support TreeSE. The package will be submitted to Bioconductor’s next release.
We chose this dataset…
As loading takes some time, the dataset is already loaded.
For other available datasets and importing methods, check OMA.
# library(MGnifyR)
# mg <- MgnifyClient()
#
# analyses <- searchAnalysis(mg, "studies", "MGYS00005128")
# analyses <- searchAnalysis(mg, "studies", "MGYS00000596")
# mae <- getResult(mg, analyses)
library(mia)
data("HintikkaXOData")
MAE <- HintikkaXOData
MAE
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] microbiota: TreeSummarizedExperiment with 12706 rows and 40 columns
## [2] metabolites: TreeSummarizedExperiment with 38 rows and 40 columns
## [3] biomarkers: TreeSummarizedExperiment with 39 rows and 40 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
As we can see, our data contains three different types of data: microbiota, metabolites and biomarkers; which are all of the TreeSummarizedExperiment class combined in ourMultiAssayExperiment object.
Microbiota:
MAE[["microbiota"]] # or MAE[[1]]
## class: TreeSummarizedExperiment
## dim: 12706 40
## metadata(0):
## assays(1): counts
## rownames(12706): GAYR01026362.62.2014 CVJT01000011.50.2173 ... JRJTB:03787:02429 JRJTB:03787:02478
## rowData names(7): Phylum Class ... Species OTU
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Metabolites:
MAE[["metabolites"]] # or MAE[[2]]
## class: TreeSummarizedExperiment
## dim: 38 40
## metadata(0):
## assays(1): nmr
## rownames(38): Butyrate Acetate ... Malonate 1,3-dihydroxyacetone
## rowData names(0):
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Biomarkers:
MAE[["biomarkers"]] # or MAE[[3]]
## class: TreeSummarizedExperiment
## dim: 39 40
## metadata(0):
## assays(1): signals
## rownames(39): Triglycerides_liver CLSs_epi ... NPY_serum Glycogen_liver
## rowData names(0):
## colnames(40): C1 C2 ... C39 C40
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
Additionally, the MAE contains phenotype or metadata about samples or subjects, that could be accessed as follows:
colData(MAE) %>% head %>% knitr::kable()
| Sample | Rat | Site | Diet | Fat | XOS | |
|---|---|---|---|---|---|---|
| C1 | C1 | 1 | Cecum | High-fat | High | 0 |
| C2 | C2 | 2 | Cecum | High-fat | High | 0 |
| C3 | C3 | 3 | Cecum | High-fat | High | 0 |
| C4 | C4 | 4 | Cecum | High-fat | High | 0 |
| C5 | C5 | 5 | Cecum | High-fat | High | 0 |
| C6 | C6 | 6 | Cecum | High-fat | High | 0 |
library(dplyr)
rowData(MAE[[1]]) %>% as_tibble() %>% summarise_all(n_distinct) %>%
knitr::kable()
| Phylum | Class | Order | Family | Genus | Species | OTU |
|---|---|---|---|---|---|---|
| 14 | 23 | 43 | 74 | 225 | 319 | 12706 |
library(miaViz)
# we agglomerate the data from OTUs to species,
# and store it at the alternative experiment slot
altExp(MAE[[1]], "Species") <- agglomerateByRank(MAE[[1]], rank="Species",
onRankOnly=TRUE, na.rm = FALSE)
# we transform data from counts to relative abundance
altExp(MAE[[1]], "Species") <- transformAssay(altExp(MAE[[1]], "Species"),
assay.type = "counts",
method = "relabundance")
plotAbundanceDensity(altExp(MAE[[1]], "Species"), assay.type = "relabundance",
n=10L, point_size=0.5, layout="density") +
scale_x_log10() +
labs(title = "Top 10 Species", x="log10(relabundance) ")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
text=element_text(size = 12))
# we agglomerate the data from OTUs to species,
# and store it at the alternative experiment slot
altExp(MAE[[1]], "Genus") <- agglomerateByRank(MAE[[1]], rank="Genus",
onRankOnly=TRUE, na.rm = FALSE)
# we transform data from counts to relative abundance
altExp(MAE[[1]], "Genus") <- transformAssay(altExp(MAE[[1]], "Genus"),
assay.type = "counts",
method = "relabundance")
plotAbundanceDensity(altExp(MAE[[1]], "Genus"), assay.type = "relabundance",
n=10L, point_size=0.5, layout="density") +
scale_x_log10() +
labs(title = "Top 10 Genus", x="log10(relabundance) ")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
text=element_text(size = 12))
Lets look at ShannonIindex and Observed Richness, where the results
are stored back automatically at the colData of the MAE
microbiome Species alternative experiment:
altExp(MAE[[1]], "Species") <- estimateDiversity(altExp(MAE[[1]], "Species"),
index="shannon",
assay.type="counts")
altExp(MAE[[1]], "Species") <- estimateRichness(altExp(MAE[[1]], "Species"),
index="observed",
assay.type="counts")
colData(altExp(MAE[[1]], "Species")) %>% head() %>% knitr::kable()
| shannon | observed | |
|---|---|---|
| C1 | 1.685654 | 79 |
| C2 | 1.368910 | 55 |
| C3 | 1.101818 | 69 |
| C4 | 1.182855 | 69 |
| C5 | 1.189534 | 61 |
| C6 | 1.368236 | 61 |
Let’s look at the histogram of both indices:
colData(altExp(MAE[[1]], "Species")) %>% as_tibble() %>%
tidyr::pivot_longer(cols=c(shannon,observed), names_to = "alpha_type",
values_to = "estimate") %>%
ggplot(aes(x=estimate)) +
geom_histogram(color = "black", fill = "gray", bins = 30)+
facet_wrap(~alpha_type, nrow = 1, scales = "free")+
labs(title = "Alpha Diversity")+
theme(plot.title = element_text(hjust = 0.5))
Let’s look at beta diversity by performing Principal Coordinate Analysis (PCoA) using Bray-Curtis as a distance method on the OTU level data:
library(scater)
MAE[[1]] <- runMDS(MAE[[1]], FUN = vegan::vegdist, name = "PCoA_BC",
assay.type = "counts")
# Calculating explained variance
e <- attr(reducedDim(MAE[[1]], "PCoA_BC"), "eig");
rel_eig <- e/sum(e[e>0])
# Ploting the ordination
plotReducedDim(MAE[[1]], "PCoA_BC") +
labs(x = paste("PC1 (", round(100 * rel_eig[[1]],1), "%", ")", sep = ""),
y = paste("PC2 (", round(100 * rel_eig[[2]],1), "%", ")", sep = ""),
title="PCoA with Bray-Curtis")+
theme(plot.title = element_text(hjust = 0.5))
With could visualize the same plot using a metadata varible (from colData), to overlay it on our PCoA plot:
# we first assign our colData from MAE to MAE [[1]]
colData(MAE[[1]]) <- colData(MAE)[colnames(MAE[[1]]),]
plotReducedDim(MAE[[1]], "PCoA_BC", colour_by="Fat") +
labs(x = paste("PC1 (", round(100 * rel_eig[[1]],1), "%", ")", sep = ""),
y = paste("PC2 (", round(100 * rel_eig[[2]],1), "%", ")", sep = ""),
title="PCoA with Bray-Curtis")+
theme(plot.title = element_text(hjust = 0.5))
As an example of analyzing different Omics data stored at the MAE, let’s look at the cross-correlation between microbiome Species and metabolite data:
# First we transform our microbiome Species data with log10
altExp(MAE[[1]], "Species") <- transformAssay(altExp(MAE[[1]], "Species"),
assay.type = "counts",
method = "log10", pseudocount = 1)
# Then we cross-correlae our microbiome data with the metabolites
correlations <- testExperimentCrossCorrelation(MAE,
altexp1="Species",
experiment2 = 2,
assay.type1 = "log10",
assay.type2 = "nmr",
method = "spearman",
p_adj_threshold = NULL,
cor_threshold = NULL,
# Remove when mia is fixed
mode = "matrix",
sort = TRUE,
show_warnings = FALSE)
Then with the help of the “ComplexHeatmap” package we visualize the results as a heatmap:
library(ComplexHeatmap)
plot <- Heatmap(correlations$cor,
# Print values to cells
cell_fun = function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(correlations$p_adj[i, j]) & correlations$p_adj[i, j] < 0.05 ){
# Print "X"
grid.text(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#1dff00"))
}
},
heatmap_legend_param = list(title = "", legend_height = unit(5, "cm")),
column_names_gp = grid::gpar(fontsize = 8),
row_names_gp = grid::gpar(fontsize = 8),
column_names_centered=TRUE
)
plot
Once again we could look at the cross-correlation between microbiome Species and biomarkers data:
# Then we cross-correlae our microbiome data with the biomarkers
correlations <- testExperimentCrossCorrelation(MAE,
altexp1="Species",
experiment2 = 3,
assay.type1 = "log10",
assay.type2 = "signals",
method = "spearman",
p_adj_threshold = NULL,
cor_threshold = NULL,
# Remove when mia is fixed
mode = "matrix",
sort = TRUE,
show_warnings = FALSE)
The resulting heatmap:
library(ComplexHeatmap)
plot <- Heatmap(correlations$cor,
# Print values to cells
cell_fun = function(j, i, x, y, width, height, fill) {
# If the p-value is under threshold
if( !is.na(correlations$p_adj[i, j]) & correlations$p_adj[i, j] < 0.05 ){
# Print "X"
grid.text(sprintf("%s", "X"), x, y, gp = gpar(fontsize = 8, col = "#1dff00"))
}
},
heatmap_legend_param = list(title = "", legend_height = unit(5, "cm")),
column_names_gp = grid::gpar(fontsize = 8),
row_names_gp = grid::gpar(fontsize = 8),
column_names_centered=TRUE
)
plot
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Helsinki
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.43 ComplexHeatmap_2.17.0 scater_1.29.3 scuttle_1.11.2 tidyr_1.3.0
## [6] dplyr_1.1.2 miaViz_1.9.0 ggraph_2.1.0 ggplot2_3.4.2 BiocManager_1.30.21.1
## [11] mia_1.9.9 MultiAssayExperiment_1.27.0 TreeSummarizedExperiment_2.9.0 Biostrings_2.69.2 XVector_0.41.1
## [16] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1 Biobase_2.61.0 GenomicRanges_1.53.1 GenomeInfoDb_1.37.2
## [21] IRanges_2.35.2 S4Vectors_0.39.1 BiocGenerics_0.47.0 MatrixGenerics_1.13.1 matrixStats_1.0.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.0 bitops_1.0-7 ggplotify_0.1.1 tibble_3.2.1 polyclip_1.10-4 DirichletMultinomial_1.43.0
## [7] lifecycle_1.0.3 rstatix_0.7.2 doParallel_1.0.17 lattice_0.21-8 MASS_7.3-60 backports_1.4.1
## [13] magrittr_2.0.3 sass_0.4.7 rmarkdown_2.23 jquerylib_0.1.4 yaml_2.3.7 cowplot_1.1.1
## [19] DBI_1.1.3 RColorBrewer_1.1-3 abind_1.4-5 zlibbioc_1.47.0 purrr_1.0.1 RCurl_1.98-1.12
## [25] yulab.utils_0.0.6 tweenr_2.0.2 circlize_0.4.15 GenomeInfoDbData_1.2.10 ggrepel_0.9.3 irlba_2.3.5.1
## [31] tidytree_0.4.4 vegan_2.6-4 permute_0.9-7 DelayedMatrixStats_1.23.0 codetools_0.2-19 DelayedArray_0.27.10
## [37] DT_0.28 ggforce_0.4.1 tidyselect_1.2.0 shape_1.4.6 aplot_0.1.10 farver_2.1.1
## [43] ScaledMatrix_1.9.1 viridis_0.6.4 jsonlite_1.8.7 GetoptLong_1.0.5 BiocNeighbors_1.19.0 decontam_1.21.0
## [49] tidygraph_1.2.3 iterators_1.0.14 foreach_1.5.2 tools_4.3.0 ggnewscale_0.4.9 treeio_1.25.2
## [55] Rcpp_1.0.11 glue_1.6.2 gridExtra_2.3 SparseArray_1.1.11 BiocBaseUtils_1.3.2 xfun_0.39
## [61] mgcv_1.9-0 withr_2.5.0 fastmap_1.1.1 bluster_1.11.4 fansi_1.0.4 digest_0.6.33
## [67] rsvd_1.0.5 R6_2.5.1 gridGraphics_0.5-1 colorspace_2.1-0 Cairo_1.6-0 RSQLite_2.3.1
## [73] utf8_1.2.3 generics_0.1.3 DECIPHER_2.29.0 graphlayouts_1.0.0 htmlwidgets_1.6.2 S4Arrays_1.1.5
## [79] pkgconfig_2.0.3 gtable_0.3.3 blob_1.2.4 htmltools_0.5.5 carData_3.0-5 clue_0.3-64
## [85] scales_1.2.1 png_0.1-8 ggfun_0.1.1 rstudioapi_0.15.0 reshape2_1.4.4 rjson_0.2.21
## [91] nlme_3.1-162 cachem_1.0.8 GlobalOptions_0.1.2 stringr_1.5.0 parallel_4.3.0 vipor_0.4.5
## [97] pillar_1.9.0 vctrs_0.6.3 ggpubr_0.6.0 car_3.1-2 BiocSingular_1.17.1 beachmat_2.17.15
## [103] cluster_2.1.4 beeswarm_0.4.0 evaluate_0.21 magick_2.7.5 cli_3.6.1 compiler_4.3.0
## [109] rlang_1.1.1 crayon_1.5.2 ggsignif_0.6.4 labeling_0.4.2 plyr_1.8.8 ggbeeswarm_0.7.2
## [115] stringi_1.7.12 viridisLite_0.4.2 BiocParallel_1.35.3 munsell_0.5.0 lazyeval_0.2.2 Matrix_1.6-0
## [121] patchwork_1.1.2 sparseMatrixStats_1.13.0 bit64_4.0.5 highr_0.10 igraph_1.5.0.1 broom_1.0.5
## [127] memoise_2.0.1 bslib_0.5.0 ggtree_3.9.0 bit_4.0.5 ape_5.7-1